Multimedia Event Detection (MED) Evaluation Task
نویسندگان
چکیده
Mayachitra Inc. team submitted runs for the TRECVID 2010 Multimedia Event Detection Pilot (MED) task evaluation. In this paper, we describe the preliminary set of results. The focus of this experiment for the Mayachitra Inc. team was to implement an end-to-end pilot system for multimedia event detection that (i) processes video,extracts and stores state-ofart video descriptors (ii) learns complex event models, and (iii) evaluates them on the test set in an efficient and effective manner. In this preliminary report, we summarize our findings on the performance of one of the important system components: the state-of-art activity detection approach. We have submitted two runs to NIST: • c raw 1: max-type fusion of the scores from binary detectors trained on the subset of visual words. • p base 1: weighted fusion of the individual scores from activity detector. and evaluated additional run: • c sel 1: cross-validation fusion of the activity detectors trained on the expanded set The performance of the runs varied significantly based on the training selection, and diversifying training set improves the detection scores. Overall, the activity recognition component has definitely showed potential in the overall event detection system for user-generated video collections. We will present a detailed analysis in the final notebook paper. 1. ACTION DESCRIPTORS Following the explosion of user-created video content, and lack of tools to efficiently index and retrieve them, the research community has made significant progress in advancing the use of static descriptors (i.e. visual descriptors extracted from video keyframes) to detect objects and scenes in automatic annotation pipeline, and to connect them to the events they describe [1, 2]. To describe a complete event, descriptors need to capture scene, objects, and their relations present, and the actual activity/action. The research effort of incorporating the activity recognition analysis in a scalable video analysis systems is still in its infancy. Lately, the computer vision community reported favorable results in action recognition domain as it extended traditional object recognition approaches to the spatio-temporal domain of video dataset [3, 4]. The actions are captured as spatiotemporal patterns in the local descriptor space. To effectively capture the actions in the user-generated video content, such as YouTube video dataset, we must consider the following: • The size of video archive is overwhelming. • User-created video content is widely diverse in content capture (camera settings), content presentation (event flow), and content editing. • Actions that need to be detected vary in scale of details that need to be captured. This boils down to the following demands on the selection of the state-of-art spatio– temporal descriptor: (i) the descriptor extraction needs to be efficient (ii) the features extracted need to be time and scale invariant, (iii) the extracted features needs to capture rich semantics of action events in video archives. For the TRECVID MED pilot task, we use the dense, scaleinvariant, spatio-temporal Hes-STIP detector of Willems et al. [5]. This detector responds to spatio-temporal blobs within a video, based on an approximation of the determinant of the Hessian. These features are scale-invariant (both in temporal and spatial domain), and relatively dense comparing with other spatio-temporal features. 1.1. Spatio-temporal interest point detection The spatio-temporal scale space L is defined by a spatio-temporal signal f convolving with a Gaussian kernel g(·; o2, r2), where o represents the spatial and r the temporal scale. L(·; o2, r2) = g(·; o2, r2) ∗ f(·) Willems et al. [5] used the Hession Matrix for the point detection task. The Hession Matrix H is defined as the square matrix of all second-order partial derivatives of L. H = Lxx Lxy Lxt Lyx Lyy Lyt Ltx Lty Ltt The Gaussian second-order derivatives in the spatiotemporal space (Dxx, Dyy, Dtt, Dxy , Dtx and Dty) can be approximate using box-filters [6]. All six derivatives can be computed by rotated version of only two different types of box filters. The box filters can be calculated efficiently using an integral representation of the video, [7]. The determinant of the matrix H defines the strength of a point of interest at certain scale. 1.2. SURF3D Descriptor The descriptor used by Willems et al. is an extension of the 2D SURF image descriptor [6]. To describe the interest point a rectangular volume with the dimension so x so x sr must be defined, where r represents the time scale, o the spatial scale and s is a magnification factor. The descriptor volume is divided into M x M x N subregion. Within each of these sub volumes 3 axis-aligned Box-Filters dx, dy , dt are calculated at uniform sample points. Every subregions is represented by the vector v = ( ∑ dx, ∑ dy, ∑ dy). The resulting descriptor is invariant to spatial rotation if the dominant orientation has been taken into account and he is invariant to spatial and temporal scale if the used Box-Filters have had the size o x o x r. We use this dense, scale-invariant, spatio-temporal HesSTIP detector and SURF3D descriptor in our activity detection pipeline. 2. ACTIVITY RECOGNITION An event for MED 2010 is “an activity-centered happening that involves people engaged in process-driven actions with other people and/or objects at a specific place and time”. In this preliminary report, we present the activity recognition component of our system.
منابع مشابه
Event-based Video Retrieval Using Audio
Multimedia Event Detection (MED) is an annual task in the NIST TRECVID evaluation, and requires participants to build indexing and retrieval systems for locating videos in which certain predefined events are shown. Typical systems focus heavily on the use of visual data. Audio data, however, also contains rich information that can be effectively used for video retrieval, and MED could benefit f...
متن کاملSemantic Indexing and Multimedia Event Detection: ECNU at TRECVID 2012
1 Abstract This year we participated in two tasks: Semantic Indexing (SIN) and Multimedia Event Detection (MED). In this paper, we present our approaches and discuss the evaluation results. Semantic Indexing (SIN): For video semantic indexing, we focus on the performance improvement by using a Weighted Hamming Embedding kernel compared with traditional BoW approaches. Below are the brief descri...
متن کاملSRI-Sarnoff AURORA System at TRECVID 2012 Multimedia Event Detection and Recounting
In this paper, we describe the evaluation results for TRECVID 2012 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks as a part of SRI-Sarnoff AURORA system that is developed under the IARPA ALDDIN program. In AURORA system, we incorporated various low-level features that capture color, appearance, motion, and audio information in videos. Based on these low-level featu...
متن کاملNTTFudan Team at TRECVID 2016: Multimedia Event Detection
The TRECVID 2016 Multimedia Event Detection (MED) challenge evaluates the detection performances of high level complex events in Internet videos with limited number of positive training examples [1]. In this notebook paper, we present an overview of our system, highlighting on the selection and fusion of multiple classification models from a wide range of feature representations to improve the ...
متن کاملITI-CERTH participation to TRECVID 2012
This paper provides an overview of the tasks submitted to TRECVID 2012 by ITI-CERTH. ITICERTH participated in the Known-item search (KIS), in the Semantic Indexing (SIN), as well as in the Event Detection in Internet Multimedia (MED) and the Multimedia Event Recounting (MER) tasks. In the SIN task, techniques are developed, which combine video representations that express motion semantics with ...
متن کاملSRI-Sarnoff AURORA at TRECVID 2014 Multimedia Event Detection and Recounting
In Multimedia Event Detection 2014 evaluation [20], SRI Aurora team participated in task 000Ex, 010Ex and 100Ex with full system evaluation. Aurora system extracts multi-modality features including motion features, static image feature, and audio features from videos, and represents a video with Bag-ofWord (BOW) and Fisher Vector model. In addition, various high-level concept features have been...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010